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Abstract. Upcoming and future astronomy research facilities will systematically 
generate terabyte-sized data sets moving astronomy into the Petascale data era. While 
such facilities will provide astronomers with unprecedented levels of accuracy and cov- 
erage, the increases in dataset size and dimensionality will pose serious computational 
challenges for many current astronomy data analysis and visualization tools. With such 
data sizes, even simple data analysis tasks (e.g. calculating a histogram or computing 
data minimum/maximum) may not be achievable without access to a supercomputing 
facility. 

To effectively handle such dataset sizes, which exceed today's single machine 
memory and processing limits, we present a framework that exploits the distributed 
power of GPUs and many-core CPUs, with a goal of providing data analysis and visu- 
alizing tasks as a service for astronomers. By mixing shared and distributed memory 
architectures, our framework effectively utilizes the underlying hardware infrastructure 
handling both batched and real-time data analysis and visualization tasks. Offering such 
functionality as a service in a "software as a service" manner will reduce the total cost 
of ownership, provide an easy to use tool to the wider astronomical community, and 
enable a more optimized utilization of the underlying hardware infrastructure. 



1. Introduction 

Since they were first introduced for general purpose computing, grapliics process- 
ing units (GPUs) have become a scienc e-enabling technolo gy across a wide variety 



of scientific fields [e.g. bioinformatics (ISchatz et all 12007 ') and weather forecasting 



(|MichaIakes & Vacliharajani ,2008) 1. The lower cost per floating point operation, the 
low power consumption, and the sustainable speedup are all motivations to utilize GPUs 
as a practical high performance computing architecture - despite being somewhat harder 
to program than CPUs. Within the astronomical community, astronom ers hav e adop ted 



GPUs to approach many data processing and simulation problems [see lFlukel (|2012|) 1. 

With the energy and the power consumption as a major obstacle tow ard further per- 
formance increase in current multi-core CPU computing architectures ('Bergma n et al. 



1^08), it is anticipated that GPUs and other many-core architectures(e.g. field-programmable 
gate array and Cell processors) will be one of the main ways to address expected petas- 
cale data analysis and visualization problems. With datasets exceeding current single 



1 



2 



A. H. Hassan, C. J. Fluke, and D. G. Barnes 



machine memory limits, and currently relatively low GPU memory (e.g. 6 GbQ), it 
is vital to effectively address the problem of data handling and synchronization over 
heterogeneous distributed CPU/GPU architectures. 

Within this work, we are presenting a general purpose framework to effectively 
utilize heterogeneous multi-core CPUs and GPUs toward addressing data intensive high 
performance computing problems in astronomy. 

2. Distributed GPU architecture 

Figure [1] shows the main framework components. Each GPU device within a node 
is managed through a CPU core, which is responsible for preparing the input data, 
invoking the GPU kernel in a synchronous manner, performing any necessary pre/post- 
processing, and sharing the data with the other threads. Each thread works as an in- 
dependent process with a two-way communication with the master thread, which han- 
dles the communication between different threads (if needed) and the communication 
with the other nodes in a master-slave pattern. The communication is performed in 
an asynchronous manner between the master threads and other threads using a custom 
message queue at each thread. Different threads can access a shared memory space, 
allocated by the master threads, to share data and/or update its status, which is utilized 
by the scheduler sub-module for task allocation. This access is controlled via one or 
more semaphores to ensure exclusive memory write. Lately, GPU drivers have started 
to support the usage of a unified address space between GPU and CPU memory (e.g. 
NVIDIA CUDA 4.(0), which can be utilized in this case as long as a control on the con- 
current access to this shared memory is minimal or not required^ Another hardware 
feature which may be beneficial to speed-up data movement between different levels of 
the memory hierarchy is to use multiple execution queues (or streams) to overlap GPU 
computation with data I/O. 

All the communications between different distributed nodes are performed through 
the master threads only. Different data scattering and gathering operations are per- 
formed in two stages: a local stage between GPUs and CPUs using shared memory, and 
a global stage over the network using the message passing interface (MPlfl protocol. 
This partitioning, as long as it suits the problem, minimizes the amount of communica- 
tion by a factor of N, where N is the number of GPU units per node. 

To demonstrate the performance of the presented framework, we use interac- 
tive volum e rendering of larger-than-memory spectral data cubes as a case study [see 
iHassan eFal. (2011.) for problem description and motivations]. With data exceeding sin- 
gle machine memory limits, real-time processing demands, and relatively high commu- 
nication overhead that scales linearly with the number of processing elements, volume 
rendering (and interactive visualization in general) presents one of the worst case per- 
formance demands. It is a perfect example to demonstrate the power of mixing shared 
and distributed memory to achieve the highest possible performance. 



1http : //www ■ nvidia . com/ob j ect/personal- supercomputing . html | 
jhttp : //www ■ nvidia . com/ob j ect/cuda_home_new . html | 

■'Atomic operations and concurrent access prevention usually degrade the GPU performance significantly. 
"^See lhttp : / /www . mcs . anl . gov/res ear ch/pro j ect s/mpi/ | for details. 
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Figure 1. Schematic diagram showing the main components of the framework. 
The framework is utiHzed to synchronize the communication between K distributed 
nodes with GPUs each. 



The presented framework is developed as a server-side rendering back-end, with a 
remote visuaUzation QT0 desktop viewer to enable user interactivity and result display. 
The CUDA driver API was utilized to implement the GPU part, with MPI as the main 
communication software backbone between nodes. 

The performance in Table[T]shows the framework performance [presented as num- 
ber of frames per second (fps)] against the dataset size in GB for the same number of 
GPUs (128 GPUs)and processing nodes (64 nodes with 2 GPU each). The maximum 
achieved performance is 2.5 teravoxel processing per second. The amount of data ex- 
changed is theoretically related to the output resolution which is megapixel/GPU. Due 
to different communication optimization and two-level gathering described before, the 
amount of data exchanged is reduced by at least 50% ( Hassan et al.ll201 fh. The main 
distributed communication processing pattern was master-slave communication with 
no data compression. 



Table 1 . Performance output of the larger-than-memory volume rendering prob- 
lem with different datasets ranging from 4 to 204 GB cubes over 128 GPUs and 64 
nodes (2 GPUs per node). 



Dimensions 


File Size 


TeslaC1060 


Tesla C2050 


(Data Points) 




(240 cores and 4GB memory) 


(448 cores and 3GB memory) 


1024 X 1024 X 1024 


4 GB 


45 fps 


52 fps 


2502x2501 X 1093 


26 GB 


41 fps 


52 fps 


2600 X 2600 X 2600 


65 GB 


38 fps 


55 fps 


5004x5002x2186 


204 GB 


33 fps 


50 fps 



jhttp : //qt ■ nokia . com/products/ | 
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3. Discussion 

The presented framework aims mainly to address the design and processing constraints 
of real-time problems, or problems which need different processing elements to com- 
municate and exchange data in order to produce the final output results (e.g. global view 
data visualization or calculating the data median). This framework can address data 
exchange and synchronization demands of different data analysis tasks for datasets ex- 
ceeding current single machine memory limits, especially when an in-situ data analysis 
is required to minimize I/O overhead. If we take, for example, an expected Australian 
Square Kilometre Array Pathfinder (ASKAP) spectral data cube (around 1TB), to do 
any processing on it using a single GPU would require partitioning the cube into 170 
sub-cubes, with a different data loading for each one of them. This might be possible 
for a single-pass accumulative operation like calculating the data minimum and max- 
imum, but cannot solve other multiple-pass problems such as calculating the median 
or standard deviation. More sophisticated data analysis tasks usually require the whole 
data to be in memory to perform measurement of global properties, and that is where 
our framework is more useful. 

Addressing data analysis and visualization processes for such data volumes will 
need a clever resource utilization and data movement minimization to achieve reason- 
able computational performance. We think distributed GPUs ca n play a key role in 



enabling such tasks with reasonable response time. We showed in taassan et alj (|20I ih 



that for a computationally intensive problem like volume rendering, replacing CPUs 
with GPUs as the main processing element can dramatically reduce the number of 
processing nodes required. Consequently, this reduction decreases the communication 
overhead and the size of the computing facility required to address such problem. 

Another aspect is working in a muti-user environment. With such data intensive 
problems we need a configurable, on-demand resource sharing model, which can fit our 
future needs [see Ostberg (2011) for a review of different available high performance 
computing resource management models]. We think the private cloud service oriented 
architecture may be a better resource sharing paradigm, with software, infrastructure 
and data offered as a service to the user via a remote thin-client. Preparing our soft- 
ware to integrate with such model can offer large scale distributed architecture in a 
more affordable way, reduce the total cost of ownership for both the software tools and 
infrastructure, and enhance access to large datasets. 

Acknowledgments. A. Hassan thanks the Astronomical Society of Australia for 
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